Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text promts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures.
translated by 谷歌翻译
我们提出了一个新的基准数据集,即Sapsucker Woods 60(SSW60),用于推进视听细颗粒分类的研究。尽管我们的社区在图像上的细粒度视觉分类方面取得了长足的进步,但音频和视频细颗粒分类的对应物相对尚未探索。为了鼓励在这个领域的进步,我们已经仔细构建了SSW60数据集,以使研究人员能够以三种不同的方式对相同的类别进行分类:图像,音频和视频。该数据集涵盖了60种鸟类,由现有数据集以及全新的专家策划音频和视频数据集组成。我们通过使用最先进的变压器方法进行了彻底基准的视听分类性能和模态融合实验。我们的发现表明,视听融合方法的性能要比仅使用基于图像或音频的方法来进行视频分类任务要好。我们还提出了有趣的模态转移实验,这是由SSW60的独特构造所涵盖的三种不同模态所实现的。我们希望SSW60数据集和伴随的基线在这个迷人的地区进行研究。
translated by 谷歌翻译
我们提出了聚类蒙版变压器(CMT-DeepLab),这是一种基于变压器的框架,用于围绕聚类设计的泛型分割。它重新考虑了用于分割和检测的现有变压器架构;CMT-DeepLab认为对象查询是群集中心,该中心填充了应用于分割时将像素分组的作用。群集通过交替的过程计算,首先通过其功能亲和力将像素分配给簇,然后更新集群中心和像素功能。这些操作共同包含聚类蒙版变压器(CMT)层,该层产生了越野器的交叉注意,并且与最终的分割任务更加一致。CMT-DeepLab在可可Test-DEV集中实现了55.7%的PQ的新最先进的PQ,可显着提高先前ART的性能。
translated by 谷歌翻译
现代自我监督的学习算法通常强制执行跨视图实例的表示的持久性。虽然非常有效地学习整体图像和视频表示,但这种方法成为在视频中学习时空时间细粒度的特征的子最优,其中场景和情况通过空间和时间演变。在本文中,我们介绍了上下文化的时空对比学习(Const-CL)框架,以利用自我监督有效学习时空时间细粒度的表示。我们首先设计一种基于区域的自我监督的借口任务,该任务要求模型从一个视图中学习将实例表示转换为上下文特征的另一个视图。此外,我们介绍了一个简单的网络设计,有效地调和了整体和本地表示的同时学习过程。我们评估我们对各种下游任务和CONST-CL的学习表现,实现了四个数据集的最先进结果。对于时空行动本地化,Const-CL可以使用AVA-Kinetics验证集的检测到框实现39.4%的地图和30.5%地图。对于对象跟踪,Const-CL在OTB2015上实现了78.1%的精度和55.2%的成功分数。此外,Const-CL分别在视频动作识别数据集,UCF101和HMDB51上实现了94.8%和71.9%的前1个微调精度。我们计划向公众发布我们的代码和模型。
translated by 谷歌翻译
这项工作提出了一个名为TEG的自我监督的学习框架,探讨学习视频表示中的时间粒度。在TEG中,我们从视频中抽出一个长剪辑,以及在长夹内部的短夹。然后我们提取密集的时间嵌入品。培训目标由两部分组成:一个细粒度的时间学习目的,以最大化短夹和长剪辑中的相应时间嵌入之间的相似性,以及持续的时间学习目标,以将两个剪辑的全局嵌入在一起。我们的研究揭示了时间粒度与三个主要发现的影响。 1)不同的视频任务可能需要不同时间粒度的特征。 2)有趣的是,广泛认为需要时间感知的一些任务实际上可以通过时间持久的功能来解决。 3)TEG的灵活性对8个视频基准测试产生最先进的结果,在大多数情况下优于监督预训练。
translated by 谷歌翻译
为视频中的每个像素分配语义类和跟踪身份的任务称为视频Panoptic分段。我们的工作是第一个在真实世界中瞄准这项任务,需要在空间和时间域中的密集解释。由于此任务的地面真理难以获得,但是,现有数据集是合成构造的或仅在短视频剪辑中稀疏地注释。为了克服这一点,我们介绍了一个包含两个数据集,Kitti-Step和Motchallenge步骤的新基准。数据集包含长视频序列,提供具有挑战性的示例和用于研究长期像素精确分割和在真实条件下跟踪的测试床。我们进一步提出了一种新的评估度量分割和跟踪质量(STQ),其相当余额平衡该任务的语义和跟踪方面,并且更适合评估任意长度的序列。最后,我们提供了几个基线来评估此新具有挑战性数据集的现有方法的状态。我们已将我们的数据集,公制,基准服务器和基准公开提供,并希望这将激发未来的研究。
translated by 谷歌翻译
对人类姿势和行动的认可对于自治系统与人们顺利互动。然而,相机通常在2D中捕获人类的姿势,作为图像和视频,这在跨越识别任务具有挑战性的观点来具有显着的外观变化。为了解决这个问题,我们探讨了来自2D信息的3D人体姿势中的识别相似性,在现有工作中没有得到很好地研究。在这里,我们提出了一种从2D主体关节键盘学习紧凑型视图 - 不变的嵌入空间的方法,而不明确地预测3D姿势。通过确定性映射难以代表预测和遮挡的2D姿势的输入模糊,因此我们采用了嵌入空间的概率制定。实验结果表明,与3D姿态估计模型相比,我们的嵌入模型在不同相机视图中检索类似的姿势时达到更高的准确性。我们还表明,通过培训简单的时间嵌入模型,我们在姿势序列检索方面取得了卓越的性能,并大大减少了基于堆叠帧的嵌入式的嵌入维度,以实现高效的大规模检索。此外,为了使我们的嵌入能够使用部分可见的输入,我们进一步调查培训期间的不同关键点遮挡增强策略。我们证明这些遮挡增强显着提高了部分2D输入姿势的检索性能。行动识别和视频对齐的结果表明,使用我们的嵌入没有任何额外培训,可以实现相对于每个任务专门培训的其他模型的竞争性能。
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardwareaware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 20% compared to MobileNetV2. MobileNetV3-Small is 6.6% more accurate compared to a MobileNetV2 model with comparable latency. MobileNetV3-Large detection is over 25% faster at roughly the same accuracy as Mo-bileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
translated by 谷歌翻译
Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame. In contrast to previous work, our embedding is only used as an internal guidance of a convolutional network. Our novel dynamic segmentation head allows us to train the network, including the embedding, end-to-end for the multiple object segmentation task with a cross entropy loss. We achieve a new state of the art in video object segmentation without fine-tuning with a J &F measure of 71.5% on the DAVIS 2017 validation set. We make our code and models available at https://github.com/tensorflow/ models/tree/master/research/feelvos.
translated by 谷歌翻译